Make is_valid_syntax parts lazy loaded#14
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce import-time overhead for rfc3987_syntax.syntax_helpers by lazily initializing the Lark grammar/parser and validator helpers, targeting faster startup for consumers that import the module indirectly (e.g., via jsonschema).
Changes:
- Lazily load grammar and the main
Larkparser via@lru_cache-backed getters. - Replace eagerly-created
is_valid_syntax_*module globals with module-level__getattr__that creates validators on demand. - Fix a missing comma in
RFC3987_SYNTAX_TERMS.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| term = name.removeprefix("is_valid_syntax_") | ||
| return make_syntax_validator(term) | ||
| msg = f"Attribute {name} not found" | ||
| raise AttributeError(msg, name=name) |
There was a problem hiding this comment.
raise AttributeError(msg, name=name) is not a valid way to construct AttributeError (it doesn't accept keyword arguments). This will raise a TypeError when an unknown attribute is accessed. Raise AttributeError(msg) (or AttributeError(msg, name) positionally if you want .name set) instead.
| raise AttributeError(msg, name=name) | |
| raise AttributeError(msg) |
| return get_syntax_parser() | ||
| if name.startswith("is_valid_syntax_"): | ||
| term = name.removeprefix("is_valid_syntax_") | ||
| return make_syntax_validator(term) |
There was a problem hiding this comment.
__getattr__ currently treats any is_valid_syntax_* attribute as valid and passes the suffix straight into make_syntax_validator(). Combined with @lru_cache(maxsize=None) this allows unbounded creation/caching of parsers for arbitrary attribute names (and can also surface confusing Lark configuration errors for typos). Consider checking term against RFC3987_SYNTAX_TERMS and raising AttributeError for unknown terms.
| return make_syntax_validator(term) | |
| if term in RFC3987_SYNTAX_TERMS: | |
| return make_syntax_validator(term) |
| __all__ = [ | ||
| "is_valid_syntax", | ||
| ] |
There was a problem hiding this comment.
There are behavior changes here around lazy module attributes (grammar, syntax_parser, is_valid_syntax_*) and around what gets exported at package import time. Existing tests only exercise is_valid_syntax via import rfc3987_syntax as h, so regressions like missing re-exports (parse, RFC3987_SYNTAX_TERMS, is_valid_syntax_iri, etc.) or __getattr__ behavior won't be caught. Add tests that assert these public attributes are available and callable after importing rfc3987_syntax.
| __all__ = [ | ||
| "is_valid_syntax", | ||
| ] |
There was a problem hiding this comment.
Defining __all__ as only ['is_valid_syntax'] changes what from .syntax_helpers import * exports. Since src/rfc3987_syntax/__init__.py uses a star import, this will stop re-exporting parse, RFC3987_SYNTAX_TERMS, and the documented is_valid_syntax_* helpers (breaking the public API/README examples). Consider removing the star import from __init__.py and explicitly re-exporting symbols (possibly via package-level __getattr__), or otherwise ensuring the previous public surface remains available without eagerly constructing all validators on import.
This PR is about optimizing performance of the main module import (syntax_helpers one). I believe there are other PRs with the same goal but my fork has been already tested in real environment
Why do I care
So, I develop a big data platform with orchestration based on airflow, and some of the airflow components start painfully slow. Specifically, api server and scheduler. And because of painfully slow startup sometimes on smaller environments I experience constant service restartings after re-deploys cause it cannot make initial start on time (which pisses me off infinitely)
When I performed profiling of the startup of those components - I've noticed that a lot of time was spent inside initial loading of
jsonschemapackage which I believe airflow uses for enforcing their SDK API schema formatThen digging deeper I found that what's actually making things stuck is this package. Airflow spins up N parallel processes and they all go and try to import jsonschema which leads to importing this package N times on pretty small machine and it leads to freezes that cause k8s to kill them and reinitiate
What did I do
I just put everything that made sense under cached functions and added module-level getter for original variables and it shrinked loading time by a lot. I already tested it in my platform and it seems working just fine with
jsonschemapackage. I don't know if any other library uses it differently but I tested what I cared for, the optimized version has been deployed to the dev environment of my platform and worked just fine. I also fixed a typo in RFC3987_SYNTAX_TERMS var (it lacked comma between terms)I originally thought to put all the names into
__all__but then asterisk import from__init__actually forces to load everything and it basically nukes all my lazy-loading efforts. Maybe it makes sense to put all the original names somewhere cause RFC3987_SYNTAX_TERMS variable actually lacks some of the function names that were presented in a moduleConclusion
If you want to me to change anything in the lazy load approach - pls let me know, I try to fit it in. Anyways, I have a good opportunity to test if I break anything in real env. I hope you find this PR good enough so we can have an optimized module loading. I believe a lot of people are pissed off because of such very deep dependency's loading time.
Thanks for bringing this library to the world anyways. I made these changes without AI and tested them manually and deployed on real environment. Yeah, it's dev environment but still is pretty real one